-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5021] [MLlib] Gaussian Mixture now supports Sparse Input #4459
Conversation
ping @tgaloppo and @jkbradley (whenever you are back!) |
Test build #27049 has started for PR 4459 at commit
|
|
||
import org.apache.spark.annotation.Experimental | ||
import org.apache.spark.mllib.linalg.{BLAS, DenseMatrix, DenseVector, Matrices, Vector, Vectors} | ||
import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors, DenseVector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alphabetize these
Test build #27049 has finished for PR 4459 at commit
|
Test PASSed. |
@tgaloppo Thanks. I shall fix them in a while. However, does the general code look good to you? |
Test build #27087 has started for PR 4459 at commit
|
Test build #27087 has finished for PR 4459 at commit
|
Test PASSed. |
def syr(alpha: Double, x: SparseVector, A: DenseMatrix) { | ||
val mA = A.numRows | ||
val nA = A.numCols | ||
require(mA == nA, s"A is not a square matrix. A: $mA x $nA") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Symmetry is a requirement of the syr() function; what is being performed is an incomplete check. The old message mismatched the check, but matched the requirement; perhaps this could be something like "A is not square, and thus cannot be symmetric" ... or something along that line?
@tgaloppo Thanks for your valuable feedback. Do you have anything more to add as of now? |
@MechCoder Nothing else stands out to me... I will give it another look after your next commit. |
Test build #27107 has started for PR 4459 at commit
|
@tgaloppo I fixed it up. Can you have a look? |
Test build #27108 has started for PR 4459 at commit
|
Also a noob question, but what is the significance of the negative variance in the tests? |
Test build #27107 has finished for PR 4459 at commit
|
Test FAILed. |
Test build #27108 has finished for PR 4459 at commit
|
Test PASSed. |
@MechCoder Do you mean the negative values in the covariance (sigma) matrices? Negative covariance indicates, roughly speaking, that variables move in opposite directions... as one increases, the other decreases (positive covariance indicate they move together, and zero covariance indicate they move independently). The covariance matrix has the variance of the individual vector components on the diagonal, and the covariance between vector components off the diagonal (and is symmetric). Does this help?? |
val gmm = new GaussianMixture().setK(1).setSeed(seed).run(data) | ||
assert(gmm.weights(0) ~== Ew absTol 1E-5) | ||
assert(gmm.gaussians(0).mu ~== Emu absTol 1E-5) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check covariance matrix here. The Esigma given here does not look right. If you need help computing it, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake; I am sorry.
@MechCoder Getting close; just need to finish up the sparse single cluster test. |
Test build #27171 has started for PR 4459 at commit
|
Test PASSed. |
@tgaloppo Alright, thanks for the explanation. Can I know, what makes you think that the covariance matrix is wrong? I calculated it manually and it seems to be right. I added the test and it also passes. |
0679440
to
5cb370b
Compare
Test build #27179 has started for PR 4459 at commit
|
Test build #27179 has finished for PR 4459 at commit
|
Test PASSed. |
LGTM cc: @jkbradley @mengxr |
@mengxr Just to make it easier for you, a small description. GaussianMixture used to support sparse input, by converting it to DenseVectors, which is non-optimal, in case of a large number of zeros. This PR makes sure that the model works without this conversion happening. |
@@ -235,12 +235,23 @@ private[spark] object BLAS extends Serializable with Logging { | |||
* @param x the vector x that contains the n elements. | |||
* @param A the symmetric matrix A. Size of n x n. | |||
*/ | |||
def syr(alpha: Double, x: DenseVector, A: DenseMatrix) { | |||
|
|||
// Wrapper around the dense and sparse methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not part of the JavaDoc. We can either move it to the JavaDoc or to the method closure.
@mengxr Fixed up your comments. Let me know if there is anything else. |
Test build #27230 has started for PR 4459 at commit
|
b997aa3
to
1b18dab
Compare
Test build #27231 has started for PR 4459 at commit
|
Test build #27230 has finished for PR 4459 at commit
|
Test PASSed. |
Test build #27231 has finished for PR 4459 at commit
|
Test PASSed. |
Following discussion in the Jira. Author: MechCoder <[email protected]> Closes #4459 from MechCoder/sparse_gmm and squashes the following commits: 1b18dab [MechCoder] Rewrite syr for sparse matrices e579041 [MechCoder] Add test for covariance matrix 5cb370b [MechCoder] Separate tests for sparse data 5e096bd [MechCoder] Alphabetize and correct error message e180f4c [MechCoder] [SPARK-5021] Gaussian Mixture now supports Sparse Input (cherry picked from commit fd2c032) Signed-off-by: Xiangrui Meng <[email protected]>
LGTM. Merged into master and branch-1.3. Thanks! |
Following discussion in the Jira.